204 research outputs found
WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking
We present WISER, a new semantic search engine for expert finding in
academia. Our system is unsupervised and it jointly combines classical language
modeling techniques, based on text evidences, with the Wikipedia Knowledge
Graph, via entity linking.
WISER indexes each academic author through a novel profiling technique which
models her expertise with a small, labeled and weighted graph drawn from
Wikipedia. Nodes in this graph are the Wikipedia entities mentioned in the
author's publications, whereas the weighted edges express the semantic
relatedness among these entities computed via textual and graph-based
relatedness functions. Every node is also labeled with a relevance score which
models the pertinence of the corresponding entity to author's expertise, and is
computed by means of a proper random-walk calculation over that graph; and with
a latent vector representation which is learned via entity and other kinds of
structural embeddings derived from Wikipedia.
At query time, experts are retrieved by combining classic document-centric
approaches, which exploit the occurrences of query terms in the author's
documents, with a novel set of profile-centric scoring strategies, which
compute the semantic relatedness between the author's expertise and the query
topic via the above graph-based profiles.
The effectiveness of our system is established over a large-scale
experimental test on a standard dataset for this task. We show that WISER
achieves better performance than all the other competitors, thus proving the
effectiveness of modelling author's profile via our "semantic" graph of
entities. Finally, we comment on the use of WISER for indexing and profiling
the whole research community within the University of Pisa, and its application
to technology transfer in our University
On optimally partitioning a text to improve its compression
In this paper we investigate the problem of partitioning an input string T in
such a way that compressing individually its parts via a base-compressor C gets
a compressed output that is shorter than applying C over the entire T at once.
This problem was introduced in the context of table compression, and then
further elaborated and extended to strings and trees. Unfortunately, the
literature offers poor solutions: namely, we know either a cubic-time algorithm
for computing the optimal partition based on dynamic programming, or few
heuristics that do not guarantee any bounds on the efficacy of their computed
partition, or algorithms that are efficient but work in some specific scenarios
(such as the Burrows-Wheeler Transform) and achieve compression performance
that might be worse than the optimal-partitioning by a
factor. Therefore, computing efficiently the optimal solution is still open. In
this paper we provide the first algorithm which is guaranteed to compute in
O(n \log_{1+\eps}n) time a partition of T whose compressed output is
guaranteed to be no more than -worse the optimal one, where
may be any positive constant
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
PlayeRank: data-driven performance evaluation and player ranking in soccer via a machine learning approach
The problem of evaluating the performance of soccer players is attracting the
interest of many companies and the scientific community, thanks to the
availability of massive data capturing all the events generated during a match
(e.g., tackles, passes, shots, etc.). Unfortunately, there is no consolidated
and widely accepted metric for measuring performance quality in all of its
facets. In this paper, we design and implement PlayeRank, a data-driven
framework that offers a principled multi-dimensional and role-aware evaluation
of the performance of soccer players. We build our framework by deploying a
massive dataset of soccer-logs and consisting of millions of match events
pertaining to four seasons of 18 prominent soccer competitions. By comparing
PlayeRank to known algorithms for performance evaluation in soccer, and by
exploiting a dataset of players' evaluations made by professional soccer
scouts, we show that PlayeRank significantly outperforms the competitors. We
also explore the ratings produced by {\sf PlayeRank} and discover interesting
patterns about the nature of excellent performances and what distinguishes the
top players from the others. At the end, we explore some applications of
PlayeRank -- i.e. searching players and player versatility --- showing its
flexibility and efficiency, which makes it worth to be used in the design of a
scalable platform for soccer analytics
Bicriteria data compression
The advent of massive datasets (and the consequent design of high-performing
distributed storage systems) have reignited the interest of the scientific and
engineering community towards the design of lossless data compressors which
achieve effective compression ratio and very efficient decompression speed.
Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of
its decompression speed and its flexibility in trading decompression speed
versus compressed-space efficiency. Each of the existing implementations offers
a trade-off between space occupancy and decompression speed, so software
engineers have to content themselves by picking the one which comes closer to
the requirements of the application in their hands. Starting from these
premises, and for the first time in the literature, we address in this paper
the problem of trading optimally, and in a principled way, the consumption of
these two resources by introducing the Bicriteria LZ77-Parsing problem, which
formalizes in a principled way what data-compressors have traditionally
approached by means of heuristics. The goal is to determine an LZ77 parsing
which minimizes the space occupancy in bits of the compressed file, provided
that the decompression time is bounded by a fixed amount (or vice-versa). This
way, the software engineer can set its space (or time) requirements and then
derive the LZ77 parsing which optimizes the decompression speed (or the space
occupancy, respectively). We solve this problem efficiently in O(n log^2 n)
time and optimal linear space within a small, additive approximation, by
proving and deploying some specific structural properties of the weighted graph
derived from the possible LZ77-parsings of the input file. The preliminary set
of experiments shows that our novel proposal dominates all the highly
engineered competitors, hence offering a win-win situation in theory&practice
The PGM-index: a multicriteria, compressed and learned approach to data indexing
The recent introduction of learned indexes has shaken the foundations of the
decades-old field of indexing data structures. Combining, or even replacing,
classic design elements such as B-tree nodes with machine learning models has
proven to give outstanding improvements in the space footprint and time
efficiency of data systems. However, these novel approaches are based on
heuristics, thus they lack any guarantees both in their time and space
requirements. We propose the Piecewise Geometric Model index (shortly,
PGM-index), which achieves guaranteed I/O-optimality in query operations,
learns an optimal number of linear models, and its peculiar recursive
construction makes it a purely learned data structure, rather than a hybrid of
traditional and learned indexes (such as RMI and FITing-tree). We show that the
PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by
more than four orders of magnitude, while achieving their same or even better
query time efficiency. We complement this result by proposing three variants of
the PGM-index. First, we design a compressed PGM-index that further reduces its
space footprint by exploiting the repetitiveness at the level of the learned
linear models it is composed of. Second, we design a PGM-index that adapts
itself to the distribution of the queries, thus resulting in the first known
distribution-aware learned index to date. Finally, given its flexibility in the
offered space-time trade-offs, we propose the multicriteria PGM-index that
efficiently auto-tune itself in a few seconds over hundreds of millions of keys
to the possibly evolving space-time constraints imposed by the application of
use.
We remark to the reader that this paper is an extended and improved version
of our previous paper titled "Superseding traditional indexes by orchestrating
learning and geometry" (arXiv:1903.00507).Comment: We remark to the reader that this paper is an extended and improved
version of our previous paper titled "Superseding traditional indexes by
orchestrating learning and geometry" (arXiv:1903.00507
- âŠ